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Traffic Handler patent summary 



1 ■ Summary of proposals 




The document Includes proposals for the following inventions: 
w"e 9 ~^ 

State Element - A smart memory cel. for serialising accesses to shared state variables 

ptlmmaL' " 77 deSi9nin9 " ^ ^ US "* — — 

£ogramma b ,e orderHst manager -A system for maintaining ordered .ogica, data structures In software at high 

Overlapped Virtual Queueing - A low overhead method for setting up and tearing down virtual cueues 



Notes: 



Figure and reference indexes apply locally within each chapter of this document 

Propel = a m*ay d ^ — (Actions and claims) relating to each 
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2. A programmable 40 Gbits/s Traffic Handler 



2.1 Background 

A basic knoa/tettje of the function and an.tcmy ot an Internet router la .seamed. 

i^S^^T^^^lTTSlT ~ °" i-* * 

2.2 The probfem and prior art 

• Queue management assigns buffer space to queues and prevents overflow 

' SSSd " imP ' emen,ed t0 Cause «"■» s — to *» their transmission rates if queues become 

* queue d s U,in9 * 8 det > ueuein 9 P rocess b V °'viding the avaiiable output line bandwidth between the 
l^Ta,!?^ Provided by weighUng the amount of bandwidth and buffer space allocated to dlf- 

In reality, system realisation is confounded by some difficult implementation Issues- 
• The processing overhead of some scheduling and congestion avoidance algorithms Is high. 
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• Priority queue ordering for some (FQ) scheduling algorithms is a non-trivial problem at high speeds 

queues implemented X access is requ.red. The volume of state increases with the number of 

" %£SSg^^ is a ™*» To And a flexible 0** 

■n.o an appropriate 

scheduler determines the order o dequeued ^ She- 1-^, rf Q ? * '"'u ° UtpUt stream - The trafflc 
scheduling problem Is thus staXd E a "TwlLnH P ? QUeUe l ° 9 f °" OWin9 schedu "n9 «age. The 

2.3 Summary of the invention 

Atafle handler architecture in which packets are processed by software and Inserted into an orderlist for sched- 

2.4 List of attached figures 



Processing 
system 



J slng flj 



Orderlist I 
management! 
system . 




Packet record 
processing — f 

Packet 3> 
handling 



Packet memory 



Figure 2.1 Traffic Handler system functional overview showing principal components 
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Figure 2.2 Functional overview of the system of MTAPs and other ASIC cores in the Q-chip 
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(KT, RapWIO) 



Packet UP 
from switch . 



SI (overspeed 
x40Gb/s 
sireaming 



Data Structure memory 
(2 channel DDR SRAM) 




Packet Storage (8 channel DDR FCRAM or Rambus) 

Figure 2.3 Trafic Handling system implementation 
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2.5 Detailed description 

Description of the concept and Invention 

• There sre nc separate, physics; sisgsf Input quew 

mine whether or not a newly arrived packet shnuirt £ ® , q .t Ue etc> 0ccu P a ncy can be used to deter- 
d (congestion managem™?) Rnlsh number^ are ufed to ™ 0 ^ Ut .S U8U ;- or whether it should be 

sssas queue and det Jine an ™ 

•rr^w^s 

' P^^^ data How processor which can 

provld.ng a large number of processing Cycles oe^ackeffor nt processoris «"l for this purpose, 
couple of system clock cycles. P pactot for packets amvm 9 at rates as high as one every 

Details of the embodiment 

rirr eMrAp ^ 

altn^^ 

Figure 2 shows a functional decomposition of the MTAP processing system 
Sd^eTref^ 

Figure 3 shows a full traffic handler imp.ementat.on using the Q-ohlp architecture ■ 

P^B^nrsUr^" 1 ^ 0 ^ 

Additional related design work 

of different classes of service which may be stored "andw.dth. The table size also limits the number 

t^rma^ 

available from a relatively small volume ofmemor? d ' fferent C ' 8SS ° f Serv,ce oomblnalions 
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* Snfhf r«forf h^rf n PEs " PEs n ^n Perform proxy lookups on behalf of each other. A single CoS table 
can therefore be split across two PEs thus halving the memory requirement. 

In the context of the ordinary use of MTAP processors this is not necessarily Innovative. As a tool being used to 
improve runcuon ana performance in a Traffic Handiiing system this gives a real advantage over the shared 
memory approach. 

Reference mater/a/ 

[1] A. Spencer Traffic management Whitepaper" 

- Background information on Traffic Management 

[2] S. Keshav "An Engineering Approach to Computer Networking", Addison-Wesley, 1997 

- Scheduling, congestion avoidance and hierarchical link sharing theory 

[3] A. Spencer "Per-Flow Traffic Handler* 

- Original design work for CfearSpeed Traffic Handling solution 

2.6 Key features of the invention 

' luluefFo?^ Pa ^ Hel enc L ueuein 3 and «hen serialised scheduling from those 

e*«aH tn J?J$£? pe ?°™ ance traffic hand,ln 9 we have turned this around. Arriving packets are first proc- 
Later™ subsequently enqueued in a serial orderiist. This is referred to as "Think First Queue 

" fa h „nl 8 tft It ° f *? ^"S^ Pipeline parallel processing architecture (the ClearSpeed MTAP processor) is 
Innovative in a Traffic Handling application. It provides the wire speed processing capability which Is 
essential for the implementation of this concept. . ™p*vmiy wmcn is 

* nmri\Tn a ^c form ? C7 B * m J5? lnde P endent P ara "e' schedulers) is thus exploted In order to solve the 
processing issues In high speed Traffic Handling. 

2.7 Scope of claim 

The claim applies most specifically to the use of MTAP processors In a Traffic Handling device used for network 

^^ISSt^ m be b ;° adened be y° nd the s P 0Cfflc of MTAPs to cover the more general 
TF-QL concept and the implementation of the orderiist. 
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3. A packet storage system for traffic handling 
3.1 Background 

A basic understanding of routar anatomy and traffic handling is assumed. 

i-teS^ J* which exceeds the output iine rate. Temporary 

acheived by va^ing the allocation o the v2Sh£ 8 P J 09 '?" qUeUe8, d,fferent 9rades of service «n be 

queues. ai.ocat.on of the ava.lable resources (line bandwidth and memory capacity) amongst the 

3.2 The problem and prior art 

ISSSSSStSSi ha/bSn Se 9 d h to C ° n K°! ^ thiS ProP ° SaL What ' d ° — is *- P— cueing 
arena. SlnJcJlgSgl tSSZ I^^SiS^ ^ the N *™* 

oran original combination of ideasdes^ 
'-^°»°-ng^ 

3. Memory capacity must be high as buffer* wi« fill up ra pidly during transient burets at high line rates 
hardware or software device wJllTp^ 

f^S!^^ -signed memory can meet (2, and 

TdS^ 

,s S ,nattem pU ngtomeet (1 ,so,ution S ^ 

In summary, it is very difficult to design architectures that meet ail four criteria. 
3.3 Summary of the invention 

A Memory Hub, used for buffering packets in a high iine rate Traffic Handier. 
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3.4 Lfst of attached figures 



^ Psckairacordt J, 



ARRIVALS 



31 



[ 



] 



Figure 3. 1 Functional overview of the components of the packet storage system 
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DATAGRAM RETRIEVAL L/NJT 



t Output B utfer 




F/o/t/re 3.3 Datagram Retrieval Unit design 




8l(ovtnpaad 



Pocket Stora^o 
(4channaU DDR FCRAM 
or Rambus per Hub) 



f 32Gb/« I 

I SlfcTtato I 

f orocasring system X 



HflhJy Integrated 
Memory Hub 
(1400 pins) 



muni 



81 

(40Gb/s) 



DOR FCRAM or Rambus) 



(a) 



(b) 



3.5 Detailed description 

Description of the concept and invention 

M =s ^m^ o„ plina . Flrst , isolate the prob(em so tha< (t (a not entgng|ed |nterdependent wfth Qther 

' and dequeue.no of packets. The 

not passed around the ^^W^^^^^X^ 9 ^ P ack ^ themselves are 
ets can be p.aced ,n memory and be ^S^SfsS ^Z^T^f^T' 
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3: A R?j?J^ 



' X'Smo^Sr ™"»"™« ~ «- *~i«*d torn ft. ...k «p a d,« b* 

• Packet memory divided into small blocks in the memory address space. The blocks mav be one of o 

appropriate). Each block of a given packet points to the next in linked list fashion. ( 
' SUiSSSf" 1 h6re tHat) Pa ° ketS d ° n0t P ° ,nt t0 ° ne another - ,n other woras ' there ,s •» logical queue 
' mt?ry^ 

* 2d^ffi?^^K^ b ^^ n,n 8 the bitma P. ° r more Erectly from the stream of 
released I If Hie ^ZT^^J f * » . ^ from memory and the memory blocks they occupy are 
the Sap P b ** r6tUmed addresses must "e buffered and their information Inserted into 

SS^nrllT™ 1 * ' P ' a l n u 9 infomlati ° n "*> memory will normally require the overhead of updating state 

EES SJ 6 °r , i " f ° rma ! i0n - A meanS ° f sm00th| y streamin 9 **• '"to memo? at 80G Ts 
required which does not get held up by Intermediate state manipulation. 

• The device that receives packets from the switch fabric is referred to as the Arrivals block Thr* nin»v, n ~* 
LrnZl/ cnuriKs wnicn can be mapped Into memory blocks. An appropriate fraament size rand thn* 

ZFZle oSt »SX far 88 f h PaCk6 i b ^ ed ° n 118 lan 9* The P p P aoket is forXdeS to me Memory 
and tne packet record t0 the system use for QoS processing and logical queuelng 

from ITlocifjiol P ' feSt SySt6m Wh,Ch reqUires no state manipulation other than to pop Items 

' i^TJSMSX^^T .' S ,h , at " S , U . PPOns '° ad ba,anci "9 across the ava « a b'e memory chan- 
£ a« 3inta Hi^ l0Ca p ° 01 ,s re P' e "'shed, it receives the addresses of memory blocks 
fhn^hf PP . different physical memory channels In equal quantities. Consecutive blocks can 
Kh load mem0ry b '° 0kS in r ° Und TObin feshion - efflcie "V spreading the address and date band- 

' £%tata U S,es^n^ a e ri°^! P00 ' be ^ mes e /"Pty. packets may need to be partially or fully dropped 
for'n Jma. oplretton y PPSd Pa ° ket TheSS afe b ° th fUnct,0ns ,he QoS P rocessor already possesses 
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SSSSSB!^^ - * 80G performance requlre- 

tion more complex. The main S£ 2 tha^ a SkX sto^dln M^f the 4 ° G paCket """^ft-no- 

"-.one^must.erea.anome.nex.p^K^ 

* 3S£tt^^ » «n be viewed as e type of DMA 

data (packet, from memory flKSSKS, 5fi£S Kta to SS ^ te *** 

' oTSl^ g a e%3mtrf^ 

obvious), and (b "to make reWe?a™ f oackete ^ « SWr 1 ?* dtetributto " «° "lock sizes (fairly 1 
packet is stored in multiple memoj w£fftt£h ?, ^.Iv^* 8 T re efflcient (,ess «»>• When a 
(which holds the address of the ne^bK refurn no m S?£rW ^ reqUeSt ,SSUed and 0,6 flrstdate 
be extracted from the first fewbytes ofabkEX b '° Ck ' S ar9e enough then ,ne next pointer can 
block Is sUII being read from memory "* " eXt request issued me remainder of the 

" "e stored in a single memory b.ock. 

(DRU) in the Memory Hub ZmSSS^^^T be fe, ° hed ° y 8 Data 3"»" R««neval Untt 
to a previous request has returned) 1 P 1 request may be sent before «» response 

* Srs imp , ,em r n9 a h,erarchicai «»«** 

packets from the Datagram Rew£a?Gnlt (DRU) neDRuS^ Dis Pl^ equests 

read controllers of Individual memory channel and r^Ulf /"fnnory blocks from the memory 

single blocks. Packets which are stored i« ,™Zu%J?£ 5 Tn,acan be done directly for packets stored In 
block" pointers can be recovered P b,OCK8 mUS ' be read from memof y so the 'next 

maximises the numbe -of : P ins availabte fo 'mem Q « 3 ° b ° ,m P. temented a * ( a ) *e hub package 

connected to a single Qo8 en? See K? ""P 1 *™"^ 0 ". a " d ( b > muWpto hubs may be 

reducing the overall pincount -d^T^ 
Details of the embodiment 

a R M e em 0 r? SST** ^ ^ de8l ' Sn d ° CUment [1J f0r de,a «-" ■» microarchitectural design of a 

- P— • ^ey isolate 

packets from memoiy on request. d fr0m 0,6 h ' gh speed llnk ,0 memof y. or retrieves 

ssm t^^stsfssss^ tsr ^n** The bus ac,s as a — ~ 
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ised as 512 byte memory blocks. 1 — ,7,wTO '' ws-n- 

Ssof h St^ ° f the ° RU b '° Ck - Tnecontroilersupervises read paining orthe 

nS/n^h **" ?%2%t 8yS !f m lm P ieme ntaiIons. 8 channels of memory provide approximately 20 GBytes/s 
da a bandwidth , and 300 M random accesses per second. This meets requirement for 40G traffic SSSaS 
SSiS? ° Verapeed J V T h " e ' s ,s conceivable that the whole system could 

doing S tSK™* - 40 diStribUte ^ fUnCHOn ° Ver mU ' Up,e deVfces » There' are many good reto^S 
' »a»^^ 

* £%Sf£2SZi^ =o d .o T g h y e COmb,nat '° n ° f hf9h P0W6r ahd ■*"»»■ a s "* la d -ce 
Additional related design work 

SuJX^ll^c ^k? 0 ? 63 ,n , detai ' the AfriVals and Dispatch b,ocks which *>* Instanced in a Streaming 
Hub. Whether these blocks contain intellectual property which should be protected Is unclear olream,na 

mXX^ 

trectio[TTlro5; f mP ' e x e !! tS Pipe " ned P rocess,n 9 techniques in order to perform rapid, on-the-fly record ex- 

Kered JSSSZS^J* "T? P?* Ra9S the reC ° rd are set to '"dicate whether a packeVtefiTty 
n„Zct P , Z or . dr °PP ed ,n Arrlva 's (dependant on the status of the packet buffer). This enables the 

processor to perform appropriate housekeeping and error reporting. 

JJteJjasflfltet ihlock- - is a DMA engine which reads a stream of records, retrieves the correspond™ oackets from 
signal to the memory hub to control the rate at which the Hub delivers packet date ccu Pancy a servo 



Reference material 

[1] A. Spencer "Per-Flow Traffic Handler" 

- Original design work for ClearSpeed Traffic Handling solution 



3.6 Key features of the invention 
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' ^^T h ^^X^%^ ni are - Manifested by the packet/packet 

' ttSKSf! " (Remote fanout ,o * me ™* *» 

tlon limits - addresses (1) and (2)) requirement without meeting implementa- 

3.7 Scope of the claim 

™cHandi,n g ,scharW^^ 

1ST. l™t g TnZT ^ be aPP " ed WHereV9r ( ' atenCy t0 ' eran,) h,sh "~ '« required 
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4. The state element 
4.1 Background 

Situations often arise whereby a function must be performed on a continuous stream of data. If the function is im- 
plemented in software on a processor, then each datagram (packet of data) which arrives in sequence from the 
stream must be stored processed and then forwarded. This process will take some finite quantity of time to exe- 
cute As the rate of packet arrival increases there will come a point at which a single processor can no longer keen 
up The function must then either be distributed across multiple processors arranged in a pipeline, or across mul- 
tiple processors arranged in parallel - each receiving a packet from the stream in turn in some round robin se- 
quence. Packets output from parallel processors are typically reordered before forwarding. 

This Is fine, as long as there Is no interdependence between the processors. They operate independently of one- 
another, perhaps sharing a common code or data store into which all have read only access. 

4.2 The problem and prior art 

A problem arises when such processors share state variables for which both read and write access is required 
Processors can not be permitted to simultaneously read/modify/writeback a shared variable as the result from the 
JntTs^uS overwritten b y the second. It is necessary to serialise the accesses. This raises two signifl- 

1 . A system for interlocking processors together must be implemented so that they may arbitrate for a 
resource and then lock it when there is contention. This control signalling can be complex and add signifi- 
cant functional and performance overhead. 

2. When a processor has successfully negotiated for a resource, it should use that resource and then release 
it as soon as possible to limit the delay Imposed on other processors. If access latencies are long to exter- 
nal memones, this can impact heavily on system performance. 

Semaphores can be used to interlocck processors, or control logic and caches can be used to intercept concurrent 
accesses and serialise them; however, these can be complex, slow and/or require significant support tied into 
hardware. Embedded memory can reduce lock out time, but delays can still be significant. 

4.3 Summary of the invention 

A smart memory cell for serialising accesses to shared state variables. 
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4.4 List of attached figures 




Figure 4. 1 1llustrating the advantage of the state element concept (a) conventional approach, (b) using state ele- 
ments. 
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4. The state element 



Single State Element 



u Coded Control 




Addr RW 



Figure 4.2 Functional overview of the state element 




Figure 4.3 Implementation overview of the state element 



16 

Company Confidential 



Document Number VRev 
Confidentiality Level: RED e Copyright CfearSpeed technology ltd 2001 



Traffic Handler patent summary^ _ r 1 . ; ^ GI& 





Figure 4.4 Implementation examples of the state and command units 



4.5 Detailed description 

State elements are the key components which perform the serialisation of acesses jnto a shared memory. This 
patent describes only the state element. In a real system state elements must be combined in state engines and 
connected to the bus. The innovative arrangement of state elements Into larger state engines (which can be con- 
nected to a system bus) is covered by a sister patent - "The State Engine". 

Description of the concept and invention 

Instead of getting parallel processors to read from memory, modify, writeback data, get them to request that the 
memory performs the modification on its behalf. In other words, position the serialisation point not within/between 
each processor, but in a simple shared processor which has local and rapid access to the memory in which the 
shared state variables are stored. 
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4. The state element 



The state engine concept and the advantages it brings is illustrated in figure 1. 



The state element is analogous to an object in OOD. It has privately stored data which is accessible only via the 
objects methods. By issuing commands, parallel processors could be considered to be making function calls to 
the object. 

A state element comprises a small block of embedded memory with single cycle read/write access time combined 
with a simple arithmetic and logic unit. The Arithmetic Unit receives commands (from processors) which comprise 
an address, data and a command code. The address identifies the siaie variable which is to be accessed, the data 
provides operands which simple compute uses to modify the variable, and the command selects a locally stored 
thread of programmed microcode which is able to read, modify and writeback the state variable within a very small 
number of system clock cycles. The result can be returned to the processor which issued each command. 

These components are shown in figure 2. 
Details of the embodiment 

Refer to the State Engine design frame work document [1] for full details of the microarchitectural design and Im- 
plementation of a State Element. 

A smart memory element comprises an embedded memory and an attached function - the function could either 
be hardwired (a finite state machine), or a programmable, microcoded circuit. The latter approach is the more ver- 
satile'and complex, and receives further attention in this document. 

A more complete picture of the system of component modules and their interconnection is shown in figure 3. Note 
the presence of special function and condition blocks. These greatly extend the functional capability of the element 
(as described in ref [1]). 

The emphasis in state element design Is on the rapid memory access speed, not the processing capability. Em- 
bedded memory blocks are small enough that single cycle access time Is achievable. Configurable R/M/W is pos- 
sible within a two cycle period as it is possible to perform a simple arithmetic operation on the result of a read and 
have it turned around for writeback within the second cycle. Typically, a command could be fully processed within 
typically 3 to 5 clock cycles. Figure 4 illustrates the simplicity of the arithmetic unit, and how the path between the 
command line and the memory has minimal delay. The lower diagram shows a more complex variant in which 
multiple items of state are held In memory. The impact on the command line turnaround (and microcode store size) 
Is significant (However, this is not to say that the lower circuit should not be used. In a lower performance system 
with a more complex set of state ft could be the preferred approach). 

Additional related design work 

Reference [1] also covers some algorithmic techniques used in conjunction with state elements. 

System threads; - Background, system threads could be programmed to operate on the data in the state memory 
when commands from processors are not being serviced. For Instance, could be useful for identifying state entries 
which are idle. 

Find free queue algorithm: - Find_free_queue system function. This is a background thread which Implements a 
Two strikes and out" algorithm for de-assigning state entries used to represent/manage queues which go idle (ie. 
empty). 

Special function units; - The 'flag unit' and 'address unit' are special function units designed to support the find free 
queue algorithm. The features they provide are considered to be of generic value and could be used by other al- 
gorithms {such as that required for maintaining meters in state elements) 

Scheduling algorithm; - The Information required by the Self-Clocked Fair Queueing algorithm cannot be mapped 
directly into the state element, it is represented in a form which makes access and manipulation more robust and 
efficient Is this a claim relevant to the state element itself or the software using the state element? (see ref [2]). 
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Reference material 

[1] A. Spencer "State Engines - a design framework 1 ' 

- Original design work for ClearSpeed State Engine solution 
[2] A. Spencer "Per-Fiow Traffic Handler" 

- Original design work for ClearSpeed Traffic Handling solution 



4.6 Key features of the invention 

• Intelligent memory - The state element localises the serialisation of parallel data accesses at the memory 
end, not the processor end. This greatly reduces the latency commonly associated with the blocking of 
state. 

• Functional versatility - The state element provides a number of (configured/) programmed remote functions 
which may be performed on the stored data - functions would comprise a small number of include data ' 
read, write, arithmetic operations and conditional accesses. 

• Flexibility - The functions can (but not must) be expressed in microcode so that the state lement remains 
programmable and does not 'tie' software executing on the processor to functions hardwired into the state 
lement. 

• System efficiency - The read/writeback occurs between the ALU and the memory inside the state element. 
Only the command travels across the system bus. This reduces the burden on the system bus as com- 
pared with conventional approaches. 

• System simplicity - The read/modify/write Is encapsulated within the state element and serialisation is 
inherently enforced by the state element logic. Processors can simultaneously Issue commands which will 
cause a function to act on the same item of state without having to first negotiate with one-another. 



4.7 Scope of the claim 

The problem was identified while using MTAP processors to access shared state in a Traffic Handling application, 
and state elements were conceived to resolve the contention issue. 

It is recognised that contention is not an issue exclusive to Traffic Handling, therefore state elements could be used 
as a general purpose tool in support of MTAP processors in any application. 

Most broadly, contention can arise when any two processors in a realtime (data flow processing) system require 
R/M/W access to a shared state variable. State Elements could therefore be used In conjunction with any parallel 
or pipelined arrangement of processors. 
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5. The state engine 
5.1 Background 

Situations often arise whereby functions must be performed on a continuous stream of data, if the functions are 
implemented In software on a processor, then each datagram (packet of data) which arrives In sequence from the 
stream must be stored, processed and then forwarded. This process will take some finite quantity of time to exe- 
cute. As the rate of packet arrival increases there will come a point at which a single processor can no longer keep 
up. The function must then either be distributed across multiple processors arranged in a pipeline, or across mul- 
tiple processors arranged in parallel - each receiving a packet from the stream In turn In some round robin se- 
quence. Packets output from parallel processors are typically reordered before forwarding. 

This is a well proven approach to high performance packet processing, but is limited in its scalability as the number 
of processors increases. Access to shared memories, be it for code or data, eventually becomes a bottleneck. Si- 
multaneous R/W access to shared state will further add to the complexity of system control signalling in order to 
resolve contention. 

MTAP processors resolve traditional issues relating to Instruction lookup, and State Element technology supports 
parallel processing systems by localising and managing serialisation to shared state. (Both technologies are pro- 
vided by ClearSpeed Technology Ltd). This leaves the issue of high speed access to multiple items of shared state 
information by multiple parallel processors. As the number of processors and the complexity of their algorithms 
increases, address and data bandwidth requirement over the system bus to the shared data will also Increase. 
This can then become a bottleneck. 

A good case in point is the challenge of Traffic Management in network routers. A significant, recognised issue in 
per-flow. Traffic Handling is that a number of items of state need to be maintained for each of a large number of 
queues. The implications of this are that (a) a considerable volume of shared memory needs to be implemented, 
(b) a lot of memory address bandwidth is required if each queue requires that separate accesses be made to dif- 
ferent (shared) state variables, (c) the memory access latency Is likely to be long thus causing state blocking dur- 
ing modification to impact on performance. 



5.2 The problem and prior art 

Contention for shared state variables can be resolved by implementing state elements as described in reference 
[1] and in the 'State Element' patent proposal. However, the successful implementation of the state element con- 
cept in high performance systems requires additional innovation to overcome the following: 

1 . Arranging processors in parallel can create a high rate of access to the same Item of state. 

2. What if a given function needs to access to multiple variables from the same address. In other words, 
needs to access and process a state record? 

3. What if multiple functions executing in a processor on a single datagram each require access to different, 
Independently addressable tables of state variables or records? 

In short, the fundamental problem being addressed is that of a high rate of state access. This probiem must be 
solved In a flexible way which enables the easy scaling of both the quantity of state being stored and the rate of 
state access. 



5.3 Summary of the invention 

A formal framework for designing an active state storage system using state elements 
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5.4 List of attached figures 




Figure 5. 1 Conceptual design hierarchy of the state engine 
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5- The state engine 
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Figure 5.2 Functional view of a state engine 
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F/gi/re 5.3 Implementation of the state cell 



22 

Company Confidential 



Confidentiality Level: RED 



Document Number VRev 
© Copyright ClearSpeed technology lid 2001 



9 



( 



Traffic Handler patent summary 



GfearSpeecI 




I f T T 

f£* Output State Array! 

6a m_M m 



t 1T t t 

Insertion (System commanci£ 1 

"13 ITT ET 



System 
controller 



free queue 
pool 



Bus Interface (Requests) 



if- Jl 



j. J- J- J. 




Extraction (System commanb) 

4J. X 41 Jt 



Utility 
bus interface 



FT 



Bus interface (Responses 



On-Chlp bus 



F/gi/re 5,4 The Queue State Engine - an implementation exampie of a complex state engine design for Traffic Han- 
dling and queue management 

5.5 Detailed description 



Description of the concept and invention 

A state engine can be built up in a structured and well defined manner using the state element as an atomic part. 
Just as atoms are the components of molecules, which may be the building blocks of simple cells, which then com- 
bine into simple organisms - state elements are combined into state cells, which are multiplied into state arrays, 
which may be grouped together to form state engines. 

This hierarchical design framework is illustrated in figure 1 . The component parts shown are: 



State Record • 
address. 



This is a conceptual entity only. It is a group of one or more state variables which share a common 



Command line - A message sent by a processor to the state engine. Fields in the command line include command 
code, address and data. The processor is effectively requesting that the function indexed by the command code 
be performed on the state record at the given address. Parameters can be both supplied and returned in the gen- 
eral purpose data field. 

State Element - As described in reference [1] and the sister "State Element" patent, a state element is a small, 
private memory which contains state variables accessible only via functions executed by the state elements con- 
trol logic. Functions typically read a state variable, perform some modification and write a new value back. The 
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result may also be recorded In a data field in the command line. The primary role of the state element Is to manage 
the state access serialisation point by executing a simple function on memory at maximum speed. 

State Cell - If there is more than one state variable in a record, it is permlssable for the entire record to be stored 
as an entry within a single state clement. However, ss each field in ths record would nsad to bs processed in turn 
this would throttle the available bandwidth to the state, in the State Cell each field of the record is stored as a single 
state variable in its own State Element. These State Elements are then chained together In a pipeline. The com- 
mand line passes from one Element to the next, the same address and control word being used at each stage to 
pick a different field from a common record and perform some function on it. State cell logic provides synchroni- 
sation between its constituent Elements which effectively make up a memory oriented pipelined processing sys- 
tem. 

The primary role of the State Cell is thus to provide a means of constructing simple, pipelined processors which 
enable more complex state records to be handled at high speeds. 

State Array - The embedded memory used in the State Elements of State Ceils must be relatively small in volume 
for rapid (Ideally single cycle) access. This places a limit on the number of instances of a state record which may 
be stored in a single State Cell. To increase the quantity of state, State Cells of a given type can be tiled to form 
a large State Array. Scaling during device layout is simplified by the State Array interconnect. The segmentation 
of an interconnection framework and the coupling of adjacent Cells in a tiled array using well defined interfaces is 
shown in the accompanying figure. The interconnect preserves order between accesses to the same State Ceils. 
Since order preservation amongst command lines accessing different State Cells Is not required, there is no need 
for the latency of command line accesses to different Cells across the array to be balanced. The Array is scalable 
in a simple way and is layout-friendly. 

Increasing the total state storage voiume by multiplying State Ceiis can also increase overall state access band- 
width as the throughput of an individual State Cell is likely to be a little lower than that of the interconnect. If the 
number of State Cells is increased to the point that the interconnect becomes the limiting factor then aggregate 
throughput can be further increased by providing multiple interconnect channels - each channel accessing a dif- 
ferent portion of the array (ie. table). This is analogous to designing a memory system with multiple, Independently 
addressable channels to increase random access bandwidth. 

The primary role of the state array is to provide scalable capacity. It also provides a means for scaling address 
and data bandwidth. 

State Engine - The State Engine combines State Arrays with all the additional glue logic and facilities which are 
required to construct a block which can be configured and accessed via a system bus. Components include: 

• Bus interface logic 

• System control logic - The state engine controller may issue (private) system commands to the state 
arrays. These commands are invoked by external blocks through accesses to the controller via the utility 
bus interface. Only (public) state commands may arrive via the main data flow interfaces. System com- 
mands configure the arrays or extract diagnostic information. 

• Bypass logic - Bypass modes enable commands to skip arrays which they are not required to access. This 
will conserve power and bandwidth. The required extraction and insertion points can also be used by the 
system controller. 

• Inter array switch connectors - Novel(?) application of (Banyan) switching technology for routing accesses 
between tables. Only required when there are more than one independent route through each State Array, 

State Engine behaviours include: 

• Message broadcasting - System commands can be broadcast throughout the memory arrays for retrieving 
status or passing configuration and control messages. This method Is also used for loading microcode into 



• Multiple accesses - If multiple arrays are connected in a pipe then it is evident that each command line 
must contain different address and command information for each array. A single command issued from 
the processor thus results in multiple state accesses. 



state arrays. 
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• Command line "morphing" - As a command lines propagate from array to array they are used and some- 
times updated as a result of each state element access. The data inserted into the command by state ele- 
ments in one array could be used by the state elements in the next. Data and perhaps even addresses 
could be modified. 



D&tafts of the embodiment 

Example state engine achltectures are documented In the "State Engine design framework*' document [1], 

The design of the state element Is detailed in the S"State ELement" patent. 

The design of the State Cell which stitches Elements together in a pipeline is shown in figure 3. 

The architecture of the State Array, and the interconnection of State Engine components is illustrated by figure 4. 
This shows a three-array instance of a Queue State Engine which supports per-flow Traffic Handling. 

Additional related design work 

Load balancing - It is possible that state records may be allocated dynamically on demand (and also deassigned). 
If multiple paths exist through a given array then it Is desirable for the stored state to be spread evenly across the 
available State Elements/Cells. The availability of state entries in such a system could be advertised by the Con- 
troller in such a way as to ensure that records are assigned from each Element in turn thus balancing the load. 

System threads: - 
Reference material 

[1] A. Spencer "State Engines - a design framework" 

- Original design work for ClearSpeed State Engine solution 
[2] A. Spencer "Per-Flow Traffic Handler" 

- Original design work for ClearSpeed Traffic Handling solution 



5.6 Key features of the invention 

All of the specified issues associated with high speed data lookup by parallel processors are addressed: 

• A formal framework for creating a parallel coprocessor using smart memory (state elements). 

• Single access, multiple lookups - A single access acts upon multiple, independent state tables within the 
state engine, ie. multiple lookups into different tables held in different memories as a result of a single 
request from the bus. 

• Pipelined architecture - Lookups Into different tables are not fired off from a point source into different 
memories. Instead, the access itself (In the form of a command line) is routed from table to table in a serial 
fashion. It is an object which travels through the QSE. 

• Command line "morphing" - As command lines propagate along the pipe from table to table they are used 
and sometimes updated as a result of each table access. The data inserted into the command by one table 
could be used by the state elements in the next. 

• State cell concept - high throughput pipelined processing (scalable processing power) 

• State array concept - 'layout friendly' scheme for scaling quantity of state, bandwidth and load balancing 
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• State engine concept - multiple orthogonal lookups from a single command uses switching technology for 
multi-lane state engine architectures. Controller provides system commands for data and instruction broad- 
casts. 

5.7 Scope of the claim 

The problems were Identified while using MTAP processors to access shared state in a Traffic Handling applica- 
tion. State engines were conceived as a way to arrange the state elements (required for managing state conten- 
tion) in a way that addressed the additional Issue of a high rate of state access. 

State Engines can also be architected from the same or similar state elements to meet the needs of other appli- 
cations - for instance meter management in the related area of Traffic Conditioning. It is therefore speculated that 
state engines could be used to deliver state element technology to any other application in which parallel (or even 
pipelined) processors are accessing shared state at high rates. 
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6. Programmable orderlist manager 



6.1 Background 

A basic understanding of router anatomy and traffic handling is assumed. 

in traffic handling packets may be placed in one of a number of queues. With more than one queue present, a 
scheduling function must determine the order in which packets are served from the queues. The scheduled order 
is determined principally by the relative priorities that the scheduler places on the queues - not on the order in 
which packets arrived at the queues. The scheduling function is thus fairly serial in character. 

For example, consider the two popular scheduling methods: 

1 . Fair Queue scheduling - every packet in the queue is given a finish number which indicates the relative 
point in time that the packet Is entitled to be output. The function that serves packets from the queue must 
identify the queue whose next packet has the smallest finish number. Ideally, only after the packet has 
been served and the next packet in the same queue been revealed can the dequeuelng function make its 
next decision. 

2. Round Robin scheduling - Queues are inspected in turn in a predetermined sequence. On each visit a pre- 
scribed quota of data may be served. 



6.2 The problem and prior art 

The fundamental problem is how to peform such scheduling algorithms at high speeds. A serialised process can 
only scale with clock/cycle frequency, or by increasing the depth of the processisng pipe which makes the sched- 
uling decision. This approach to scaling runs out of steam at 40 Gblts/s line rates when the available silicon 
processing technology may only be able to provide a couple of system clock cycles per packet. 

On top of this, the scheduling and queue management task is further confounded by a requirement for a large 
number of potentially very deep queues. Hardware which executes the scheduling function in a serial manner is 
then likely to be highly customised and therefore inflexible if it is to meet the required performance. 

6.3 Summary of the invention 

A system for maintaining ordered logical data structures in software at high speeds 
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6.4 List of attached figures 
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Figure 6. 1 Concept of orderiist management using a bin sort approach 
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F/flfi/r© 6.2 Functionai overview of an orderiist management system 
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6.5 Detailed description 

introduction 

into and output queue, sort packets on arrival directly into the output queue, 
potentially huge ordered data structure. 

However this approach enables parallelism to be exploited in the implementation of the solution When pe.tonm- 
mo! tow improved mrough brute force in a serialised solution, the way forward is to find an approach 
which can scale up through its parallelism. 

The -orocrammable 40 Gbits/s Traffic Handled proposal describes an innovative parallel process! "S a^teehire 

other half of the problem - the maintenance of a large orderlist at high speed. 

Description of the concept and invention 
Figure 1 shows the basic concept of bin sorting: 

. Consider a small set of bins. Each bin is used to contain packets with a certain range of finish numbers. 
The content of a bin is not ordered. 

• A function is required which receives packets and places them in the appropriate bin. 

. Another function is requred which reads the content of each bin In turn in ascending order of the fin.sh 
number ranges. 

. Assume that just as bins are emptied at one end of this sequence new bins are installed at the other as 
packets arrive with finish numbers which are. on average, constantly increasing In value. 

. Thus, a stream of packets are arriving with randomly ^8 » ™rnb9^.Th88e are sorted into bins. A 
stream of packets is output in a coarsely sorted order which depends on bin size. 

• The final stage bins can relatively easily be sorted into actual order for output. 

. T».«,«llon S m a top^le<.nU»bl„ s «e 8 d«co.».to 1 h« s .polnl.r 8 .Th.tun«lon s couWb.n» a pp«i m io 

processors and the pointers into a state memory. 
. A data structure is proposed which comprises more than one set of bins Within a "t of bl^fte finish 

JumTar range Is constant, but between sets ^K^a^ 

range of a single bin In an adjacent set. 
. - When a bin is emptied, it is sorted into the next set of bins. 

. Either this is repeated until the finish number range of the final set of bins is unity, OR when the smallest 
bins are empties they are subject to a final sort before forwarding In order. 
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Details of the embodiment 

Figure 3 shows how MTAP processors can be used to implement an orderlist manager. The numbering shows the 
sequence of events which occur as packets are scheduled, binned, re-binned, sorted and output etc. (Full walk- 
through could be provided if necesary) 

• When MTAP processors are arranged in a data flow processing architecture they are well suited to the 
processing of a high speed stream of packets. They naturally operate by performing batch reads of data 
[1], doing some processing, and then pushing data out onto queues. 

- State Engines used as hardware accelerators can enable the MTAP processors to store and manage the 
logics; state required for the bins. 

' The bins are most conveniently implemented as LIFO stacks. This minimises the required state per bin 
and simplifies the management of bins as linked lists in memory. 

• When each packet is stored in a bin its location in memory is retained in the state engine. This can be used 
as a pointer by the next packet which needs to be written to the same bin. Each bin is thus a stack in which 
each entry points to the next one down. 

• A databuffer block is used to store the bins. The block contains a bin memory and presents producer and 
consumer interfaces to the processor [1 J. The consumer receives a stream of packets and simply writes 
them to a supplied address. The producer receives batch read requests from the procesor and outputs 
data from the requested bin. 

• As each bin is organised as a linked list, it is the responsibility of the producer to extract the linked list 
pointer from each packet as it is read from the bin. Using SRAM the access time should be fast enough to 
make this serialised process efficient. 

• in a real system embodiment it Is not necessary to store the actal packets in the bins. Small records which 
represent records can be processed in their place. This is described in [2]. This simplifies implementation 
as the bins now store small entities (records) of fixed size. 



Additional related design work 

On-demand loao* balancing: The MTAP processors are split between the enqueueing (scheduling) task and the 
dequeueing (final sort) task. A sufficient number of processors must be implemented in order that they can cope 
with the transient worst case rate of packet arrival. However, the nominal arrival rate is much lower. This would 
mean that a number of processors could routinely lay idle or be underused. The proposal is that a small number 
of processors are assigned permanently to either the enqueueing or dequeueing tasks. The remainder may float. 
If input congestion is detected then the floating processors thread switch and assist in the enqueueing task. When 
the congestion is cleared, the floating processors migrate to the dequeueing task and help to clear the backlog in 
the queues. If dequeuing is well resourced, then floating processors may default to peripheral tasks such as sta- 
tistics pre-processing for subsequent reporting to the control plane. 

Sha.g'owecl memory management; This is an essential element of the orderlist management system solution. Al- 
though simple, I felt it might be sufficiently valuable as an idea to descibe it separately. Any given data structure 
needs functions to read and write entries, logical state to characterise the structure, and underlying memory man- 
agement to efficiently store the structure. The MTAP processor and accelerator only acheive the first two of these 
No mention has yet been made of maintaining a freellst of available memory and allocating memory for the data 
structure to grow into. This in itself can often incur considerable overhead. The efficiency of the orderlist manager 
is only possible because the memory management has already been performed for it as follows: 

Background: 

' l n l 0C lnS fflc handfin 9 fs is Practical to divorce the packet buffering from the processing task. As described 
n the 40G programmable Traffic Handler proposal, packets are stored in memory within the packet buffer- 
ing system. Small records are passed to the processing system which efficiently manipulates records in 
place of packets they represent. 
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. The packet memory is partt.oned.nto snja.. Hooka o, jjjjj* A free list or bKmap is stained which 
keeps track of which blocks are allocated and wh.ch are free. 

. IZTJto*. ft. record - oonftln ft. ™ W ««~. - « <W> — - »— » " «*» »» 
packet is stored. 
Packet record handling and storage 

• !£™«ftX^^ 

. ZHSTTii manipulation are . » - «. ~ »dTAPa ».pao..a,. Bo — , 

management relies directly on the packet memory management. 
. Bin memory management concept „, Mc 

and recovered, the system is very robust. 
T^a an, random* «o»d «Hhft ft. rooord m.mo,y. ft. -aoo-do balomjftg ft . .W bin X point 

and 'Next_B'. It can be seen that "Next_A Is the same as Self.B . 
provides a considerable reduction in the record storage requ.rement. 

^ Twostorage systems can share the same memory manager when the write/read accesses to one are 
nested within the write/read accesses to the other. 



nested witnin we wme/reau a^ OW8 ~ n n woH itet 

age. 



reference [3] for further details. 
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Reference material 

[1] ClearSpeed "Network Processor architecture document - V1.0 M 

- Architecture proposal for ClearSpeed 40G Network Processor 
[2J A. Spencer "Traffic Handler architecture document - V0.1" 

- Architecture proposal for ClearSpeed 40G Traffic Handler (in progress) 
[3J K. Cameron "Software Queueing for Traffic Management" 

-Discussion document 

6.6 Key features of the invention 

• A single orderiist is used instead of multiple queues. 

• A method of iterative binning is described which makes the management of large orderlists very efficient. 

• enginef managerTient is P erfor ™ e d entirelty in software using MTAP processors and accompanying state 

• The processing and state resource can be partitioned to provide either single or multiple orderlists. 

6.7 Scope of the claim 

The claim is considered to be original on two fronts: 

Firstly, it is an alternate way of queueing data. Packets (or packet records) are queued directly into an orderiist 
instead of in separate per-stream queues. 

Secondly, it is unlikely that MTAP processors have been used previously to perform the queue management - el- 
mem" ***** Pr0GeSS of makin ? the bjnni ng decisions, or in the use of state engines for bin pointer manage- 

Because the invention relies on binning and data structure management in software, it is also speculated that al- 
ternative data structures could be mapped into the hardware resources and managed by different software proc- 
esses. This implies that the Invention could have broader application beyond that of suppoorting Traffic Handling. 
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7. Overlapped virtual queueing 



7.1 Background 

Sonne prior understanding of Traffic Management, Traffic Handling and (virtual) queueing would be a benefit. 

Per-flow traffic handling requires the independent queueing of traffic belonging to hundreds of thousands, if not 
millions of different connections. Virtual queueing is a text book approach in which a limited number of physical 
queues are shared dynamically between a much larger number of connections. In virtual queueing a queue is as- 
signed from a common pool to a stream (flow or flow aggregate) when that flow becomes active. Conversely, if 
an assigned queue remains empty for a given duration, it is effectively inactive - an unused resource which should 
be returned to the pool. Observing that a finite memory volume limits the number of packets which may be buff- 
ered at any given time, it then follows that a finite pool of shared physical queues can be used to support a much 
larger number of end-to-end connections as the connections can not all simultaneously have traffic backlogged in 
the traffic handling system. 

So, the idea is that if a connection does not actually have traffic backlogged in a queue at a given point in time, 
then its queue is unused and might just as well not exist. Queues are allocated and deallocated on a per-demand 
basis. A queue is only assigned to a connection when that connection has backlogged traffic. 



7.2 The problem and prior art 

The implementation of virtual queueing presents its own problems: 

1 . How are queues deassigned? This could be tricky if there are a number of points in the handler at which 
packets belonging to a given connection could be buffered. 

2. How are queues assigned to new connections? Packets belonging to new connections could appear and 
make an on the spot request for a queue. 

3. The purge that is necessary In-between a queue being deassigned and it being assigned to a new connec- 
tion requires significant system wide messaging and state synchronisation. 

This last issue is the core focus of the ClearSpeed "Overlapped Virtual Queueing" concept. 

Consider the simple high level view of traffic handling behaviour illustrated in figure 1 . 

Stream labels attached to packets arriving at A are used to look up a destination queue identifier in B. This queue 
identifier Is then used at C to access a table of queue state in D thus enabling the packet to be appropriately en- 
queued in E. Subsequent de-queueing by F must also access the queue state associated with each packet when 
it Is served. 

There is a close relationship between the Information held in B and the organisation of state in D. In the context 
of virtual queueing these relationships are more specifically: 

• Lookup entries in B must be created when packets with unrecognised stream labels arrive at A. These 
entries must point to an available Q-state entry In D. 

• Based on accesses from both C and F, D must be capable of determining (a) whether a queue is empty, 
and (b) whether the queue is eligible to be de-assigned and returned to the pool in B. 

• ' When D de-assigns a queue, then the related entry in B must be removed. 

The Implementation of these behaviours should address the problem of pipelining effects caused by packet/mes- 
sage buffering within all links shown in the figure - ie. packets of a newly assigned/deasslgned queue can persist 
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In buffers between A, C and F. The virtual queueing solution must also impose minimal overhead on the system 
In terms of logical complexity, messaging bandwidth and the storage of additional state. 

7.3 Summary of the invention 

A low overhead method for setting up and tearing down virtual queues 

7.4 List of attached figures 




Figure 7.1 Simple schematic view of traffic management 
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Figure 7.2 Queue lookup table 

7.5 Detailed description 

introduction 

Conventionally, one might deassign a virtual queue from an inactive connection, wait for any residual packets for 
that connection to be purged from the queues, and then reassign the queue to a new connection. This is possible 
but can require a lot of control signalling, additional state and synchronisation across the system. Overlapped vir- 
tual queueing eliminates the purge purge phase and make deasslgnment and reassignmnet simultaneous. 

Instead of a queue being either assigned or unassigned to a connection, it is either assigned or pending re-assign- 
ment. Only after boot-up might a queue actually be unassigned. This means that a queuanormally always belongs 
to a connection. 

In the pending reassignment state it may still be used by the old connection (that is, assuming the previously in- 
active connection suddenly comes back to lifei). 

At the moment of reassignment a new connection takes ownership of the queue and the old connection may no - 
longer place any further packets in it. The old connection will be granted a new queue as and when further packets 
arrive. 

Oeassignmnet is thus implicit in reassignment - there is no explicit messaging involved. 
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the traffic handling pipeline, those packets are left alone and i are ' """^r^ connectioni wit h old pack- 
queued. 



Description of the concept and Invention 
Queue lookup : 

is Hlustrated in figure 2 and described as follows. 
The table can be indexed in one of two ways. 
' 1 . An entry is Identified by content addressing using the key. 

2. The value (of length N bits) can be used to directly index the table of 2 N entries. 
The required behaviour uses these addressing modes as follows: 

. A key is presented to the table in order to lookup a value * additional data. An exact match is required. 

. An unrelated value* attached to each key used for the lookup. If the lookup Is successful then the 
attached value is returned with the results. 

this new table entry and a NULL value in place of the attached value, 
in the additional data field. 

to pool when the result Is returned. 
fttata management : 

in order to support the queue lookup block, an additional function is required which can monitor the activity of 
Jueue? ?eaS Idle queues, and pass their identities to the queue lookups pool. 

This function Is most obviously associated with the g^^g ^te^on is^not the pjfcnwyltows'of thta 

the embodiment section next. 

each packet, referred to as Connection_status. Bc>tiiv«»n quws Id«v fl aa 7 dentlfle g packets belonging 
queue in a robust manner. For instance, Connectionless could be used in the following ways. 
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1 . Differentiating between the first packet of a new connection and residual packets of an old connection 

aultZln^Z bet T 9n th8 ? UeU ° ,0 ° kUp and en£ l ueuein g '°9lc at the time of reassignment. When the 
queue monitonng function re-categroises an idle queue as "pending deassignmenf it must mark that 

tonl* tk- 6 ' H 6 ' at f ' the flrst paoket of a new c °™ecti°n will arrive with the Connection status 

flag set This provides two clear reference points for the queue monitoring an enqueuing logic to'work 

ETJSE ,n - between - the a "V acting °" the queue state functions can recognise 

and condrtionally handle any packet belonging to the old connection. For instance, the queue monitoring 

ln1fn n „. ma V 3n ° f r l qU ?? S Whi ° h Sre pendi " 9 re ^'g^«nt. The enqueue!*, function may c™he 
rl^JntSZV* ' m 6 ° nly Whe " R S6eS the Connection_status bit set. The status Is unaffected by 
residual packets of the old connection. y 

2 ' aTbaSo^H tTh 6 " Pa0ke f °? he " eW connect,on an « "*idual P^ets of the old connection which 
£52££?i . qU !'J? S « UCtUre - " iS P ° Ssib,e that res,dual P ackets wnioh are backlogged in the 
T,ZnZ? Z Cr H. PSet the Queue State when the y are scheduled and the dequeueing function sends a 
and new packefs 3 '"" qU6Ue M ° n may be COnfi9Ured to deaI different °v]th oW 

F JanZ'Z^^T n Z tem id,e '. stream does s P ri "9 b ^k into life in this way after its queue has been deas- 
s.gned then the moment the queue is reassigned to a different stream, then the old stream is simply assioned to 
a new virtual queue. There is thus almost no overhead to virtual queue management and queuTswitching 

Details of the embodiment 

The Implementation of overlapped virtual queueing Is described in detail in reference [1]. 
For the queue assignment: 

• Content Addressable memory is an ideal memory technology for the table memory. 

• The memory is supported by a function which provides and recycles the 'free queue' identities. 
For the state management- 

' l^ln e ",?"l e iS a " id u eal basis for ,he oueue monitoring function. A background system function can he 
Reference material 

[1] A. Spencer "Traffic Handler architecture document - V0.1" 

- Architecture proposal for ClearSpeed 40G Traffic Handler (in progress) 

7.6 Key features of the invention 

' o B rlL^ be,0 " 9S to a — » a «her "as the state of being either 

* b e&«^^ 

• Deasslgnmnet Is Implicit In reassignment - there Is no explicit messaging Involved 

" ttvely the P 3Ckets ° f the 0,d c °nnectio n) those packets are effec 

tSSXXSS qUeUe ' hand ° Ver iS managed ln a contro » ed ^ * minimise 

' Seasu^ When they are b0th em ^ AND *« * a demand for resource. There is not 

7.7 Scope of the claim 
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This proposal Is only relevant to Traffic Handling application In which many (hundreds of thousands of) queues 
are required. Specifically, this therefore relates to per-flow Traffic Handling. 
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